9  Data Manipulation - Practical

9.1 Demonstration: Importing, cleaning, and transforming a dataset

I’m going to work through the process of importing, cleaning, and transforming a dataset.

I’ll create a single script that will complete every stage. This means that I can repeat the process, or share it with others.

Load Packages

The script relies on the features of the dplyr library. I load this first, after cleaning my environment.

rm(list=ls())
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

Import Data

Then, I import my data.

url <- "https://www.dropbox.com/scl/fi/a9bad7bh74n9jazge29eg/netball_data_999.csv?rlkey=hattwdksw8ispfi6lymg0ox66&dl=1"
data <- read.csv(url)
rm(url)

This code snippet contains three separate instructions.

First, I create an object in my environment called [url]. This tells R where my datafile is stored.

Then, I use the read.csv function to pull the file located at [url] into my environment. Notice that I do this by creating a new object called [data], which is the output of the function read.csv.

Finally, I remove the object [url] from the environment by calling the rm function.

Inspect the Data

Now, I inspect the data that has been pulled from dropbox. I use the head, str and summary functions to get a sense of the data.

head(data)    # the first six rows ('observations')
  player_id     position goals_scored assists turnovers
1         1 Goal Shooter           25       8         3
2         2  Goal Attack           18      12       999
3         3  Wing Attack           10     999         1
4         4       Center          999      22         5
5         5 Wing Defense            5      10       999
6         6 Goal Defense           12     999         2
str(data)     # the variable types that R has ASSUMED
'data.frame':   10 obs. of  5 variables:
 $ player_id   : int  1 2 3 4 5 6 7 8 9 10
 $ position    : chr  "Goal Shooter" "Goal Attack" "Wing Attack" "Center" ...
 $ goals_scored: int  25 18 10 999 5 12 999 30 15 7
 $ assists     : int  8 12 999 22 10 999 6 14 12 5
 $ turnovers   : int  3 999 1 5 999 2 3 4 999 1
summary(data) # descriptive statistics for each of the variables
   player_id       position          goals_scored       assists     
 Min.   : 1.00   Length:10          Min.   :  5.00   Min.   :  5.0  
 1st Qu.: 3.25   Class :character   1st Qu.: 10.50   1st Qu.:  8.5  
 Median : 5.50   Mode  :character   Median : 16.50   Median : 12.0  
 Mean   : 5.50                      Mean   :212.00   Mean   :208.7  
 3rd Qu.: 7.75                      3rd Qu.: 28.75   3rd Qu.: 20.0  
 Max.   :10.00                      Max.   :999.00   Max.   :999.0  
   turnovers     
 Min.   :  1.00  
 1st Qu.:  2.25  
 Median :  3.50  
 Mean   :301.60  
 3rd Qu.:750.50  
 Max.   :999.00  

Dealing with missing data

Data is often ‘messy’. We’ll cover how to deal with this in more detail in the next few weeks, but for now, assume that we want to represent missing data with the value NA.

In this dataset, ‘999’ represents an missing value.

First, we’ll replace these values with ‘NA’.

Note that R will do this automatically if the element is truly missing.

# Replace 999 with NA in the dataset
data[data == 999] <- NA

The simplest approach to dealing with missing data is to remove any observations with missing data (‘NA’) in any of the columns. We can use the na.omit function to do this.

# Remove observations with missing data
data_clean <- na.omit(data)

Notice that, following good practice, I haven’t overwritten the original data frame. Rather, I’ve created a new data frame that is a ‘clean’ version of the old one.

Removing variables

Datasets often contain variables that we don’t need. To keep things tidy, I usually delete anything I won’t be using. The advantage of doing this in a script is that, if I change my mind later, I can just edit the script to retain the variable or variables.

# Remove variable [turnovers]
data_clean$turnovers <- NULL

# I'm going to reintroduce [turnovers]

turnovers_temp <- na.omit(data$turnovers)

extracted_players <- inner_join(data, data_clean, by = "player_id")

extracted_players$position.y <- NULL
extracted_players$goals_scored.y <- NULL
extracted_players$assists.y <- NULL

data_clean <- extracted_players

rm(extracted_players)

Renaming variables

Often, we want to rename variables, especially if the dataset has been provided by an external organisation.

In the following example, I change the name of the variable [position] to [player_position], [assists] to [player_assists] and [turnovers] to [player_turnovers]:

# Rename variables
names(data_clean) <- c("player_id", "player_position", "goals_scored", "player_assists", "player_turnovers")

Changing variable types

Last week, we talked about the importance of making sure that R understands the types of variables in our dataset.

In particular, anything that is going to be a grouping variable needs to be identified as a factor.

In our example, [player_position] is a factor, as we can use it to group players into different categories.

# Change variable types
data_clean$player_position <- as.factor(data_clean$player_position)

Then, I can use the str function to check variable types. Notice that R has identified that there are two levels of the factor [player_position].

# Use the 'str' function to check variable types
str(data_clean)
'data.frame':   3 obs. of  5 variables:
 $ player_id       : int  1 8 10
 $ player_position : Factor w/ 2 levels "Goal Shooter",..: 1 1 2
 $ goals_scored    : int  25 30 7
 $ player_assists  : int  8 14 5
 $ player_turnovers: int  3 4 1

Creating new variables based on operations

I’ll now move to creating a new variable based on something done to existing variables.

Calculate a new variable, [data_clean$ratio], which is [goals_scored]/[player_assists].

# Calculate 'ratio' as [goals_scored] divided by [player_assists]
data_clean$ratio <- data_clean$goals_scored/data_clean$player_assists

I might want to round this variable:

data_clean$ratio <- round(data_clean$ratio,2) # round the new variable to 2 decimal places
head(data_clean)
  player_id player_position goals_scored player_assists player_turnovers ratio
1         1    Goal Shooter           25              8                3  3.12
2         8    Goal Shooter           30             14                4  2.14
3        10    Wing Defense            7              5                1  1.40

Write the data as a CSV file

Finally, I’ll Write the dataset as a .csv file to an appropriate location.

write.csv(data_clean,"my_netball_data.csv")

9.2 Activity: Importing, cleaning, and transforming a dataset

Now, create a script that conducts all of the following steps.

Begin by including the following snippet, which directs R to the file location:

url <- "https://www.dropbox.com/scl/fi/w6cmx5fgl6y5e1sizeskf/t08_data_b1700_01.csv?rlkey=trab5xa3hpqhf01ruhx1w2ers&dl=1"
data <- read.csv(url)
rm(url)

Working on the dataframe ‘data’, and with the dplyr library installed and loaded:

  1. Use the head, str and summary commands to overview the dataset.
  2. In this dataset, ‘999’ represents missing data. Replace these values with ‘NA’.
  3. Remove any observations with missing data (‘NA’).
  4. Remove the variable [X] from the dataset.
  5. Rename each of the variables to the format [var_id], [var_a] etc.
  6. Change the variable types as follows: [var_id] = factor, [var_a] = factor.
  7. Use the ‘str’ function to check variable types.
  8. Calculate a new variable, [var_e], which is the sum of [var_c] + [var_d].
  9. Write the dataset as a .csv file to an appropriate location where you might be able to access it at a later date.

Possible solutions

Use the head, str and summary commands to overview the dataset.

Show the answer
# Overview the dataset
head(data)
str(data)
summary(data)

In this dataset, ‘999’ represents an outlier. Replace these values with ‘NA’.

Show the answer
# Replace 999 with NA in the dataset
data[data == 999] <- NA

Remove any observations with missing data (‘NA’).

Show the answer
# Remove observations with missing data
data_clean <- na.omit(data)

Remove the variable [X] from the dataset.

Show the answer
# Remove variable 'X'
data_clean$X <- NULL

Rename each of the variables to the format [var_id], [var_a] etc.

Show the answer
# Rename variables
names(data_clean) <- c("var_id", "var_a", "var_b", "var_c", "var_d")

Change the variable types as follows: [var_id] = factor, [var_a] = factor.

Show the answer
# Change variable types
data_clean$var_id <- as.factor(data_clean$var_id)
data_clean$var_a <- as.factor(data_clean$var_a)

Use the str function to check variable types.

Show the answer
# Use the 'str' function to check variable types
str(data_clean)

Calculate a new variable, [var_e], which is the sum of [var_c] + [var_d].

Show the answer
# Calculate 'var_e' as the sum of 'var_c' + 'var_d'
data_clean$var_e <- data_clean$var_c + data_clean$var_d

9.3 Demonstration: Creating and manipulating various data structures

Preparation

First, I load the tidyverse library and clear my environment.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.5
✔ ggplot2   3.5.1     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
rm(list=ls())

Then, I create a simple dataframe that contains the statistics of a hypothetical football player:

football_df <- data.frame(
    player_name = c("John", "Mike", "Lucas", "Eva"),
    goals_scored = c(5, 10, 3, 6),
    assists = c(4, 6, 2, 3)
    )
print(football_df)
  player_name goals_scored assists
1        John            5       4
2        Mike           10       6
3       Lucas            3       2
4         Eva            6       3

Convert the dataframe to a tibble and print it.

football_tibble <- as_tibble(football_df)
print(football_tibble)
# A tibble: 4 × 3
  player_name goals_scored assists
  <chr>              <dbl>   <dbl>
1 John                   5       4
2 Mike                  10       6
3 Lucas                  3       2
4 Eva                    6       3

Adding observations and variables

I can add a new player’s statistics to the dataframe and a new column for [games_played] using the rbind function.

football_df <- rbind(football_df, data.frame(player_name = "Sophia", goals_scored = 7, assists = 5))

I can add a new variable for all players (observations) called [games_played]

football_df$games_played <- c(10, 12, 9, 11, 10)
print(football_df)
  player_name goals_scored assists games_played
1        John            5       4           10
2        Mike           10       6           12
3       Lucas            3       2            9
4         Eva            6       3           11
5      Sophia            7       5           10

Filtering the data

Using dplyr, filter out players who’ve scored more than 5 goals.

top_scorers <- football_df %>% filter(goals_scored > 5)
print(top_scorers)
  player_name goals_scored assists games_played
1        Mike           10       6           12
2         Eva            6       3           11
3      Sophia            7       5           10

Operating on the data

Calculate a new column [goal_per_game] and get the average goals scored by the team.

football_df <- football_df %>%
      mutate(goal_per_game = goals_scored / games_played)
avg_goals <- football_df %>%
      summarise(mean_goals = mean(goals_scored))

print(football_df)
  player_name goals_scored assists games_played goal_per_game
1        John            5       4           10     0.5000000
2        Mike           10       6           12     0.8333333
3       Lucas            3       2            9     0.3333333
4         Eva            6       3           11     0.5454545
5      Sophia            7       5           10     0.7000000
print(avg_goals)
  mean_goals
1        6.2

Sorting the data

Arrange players by goals scored in descending order.

sorted_df <- football_df %>%
  arrange(desc(goals_scored))
print(sorted_df)
  player_name goals_scored assists games_played goal_per_game
1        Mike           10       6           12     0.8333333
2      Sophia            7       5           10     0.7000000
3         Eva            6       3           11     0.5454545
4        John            5       4           10     0.5000000
5       Lucas            3       2            9     0.3333333

Grouping observations

Group players by position and get the total goals scored for each position.

football_df$position <- c("Forward", "Midfielder", "Defender", "Forward", "Midfielder")
position_goals <- football_df %>%
  group_by(position) %>%
  summarise(total_goals = sum(goals_scored))
print(position_goals)
# A tibble: 3 × 2
  position   total_goals
  <chr>            <dbl>
1 Defender             3
2 Forward             11
3 Midfielder          17

9.4 Activity: Creating and manipulating various data structures

The following activity allows you to practise some of the techniques covered above.

You may also have to do some research to find out how to complete some of the challenges!

  1. Install and load necessary packages (e.g. tidyverse)

  2. Download the data from a url, and create a dataframe called [data].

https://www.dropbox.com/scl/fi/ank5v48ndxqggw6mqx6dh/football_data_999_with_position.csv?rlkey=2zqwk0tfxkq3mc6ikvxoyu8r6&dl=1

  1. Convert 999 values (missing data) to NA.

  2. Create a new dataframe called [data_clean] by removing observations with missing data.

  3. Convert the above dataframe to a tibble called [football_tibble] and print it.

  4. Delete all objects apart from [football_tibble].

  5. Add a new player’s statistics to [football_tibble].

  6. Using dplyr, filter out players who’ve scored fewwer than 16 goals.

  7. Calculate a new variable [goal_per_game] and get the average goals scored by the team.

  8. Arrange players by goals scored in descending order.

  9. Finally, group players by position and get the total goals scored for each position.

Solutions

Install and load necessary packages.

Show the answer
rm(list=ls())
library(tidyverse)

Download data

Show the answer
url <- "https://www.dropbox.com/scl/fi/ank5v48ndxqggw6mqx6dh/football_data_999_with_position.csv?rlkey=2zqwk0tfxkq3mc6ikvxoyu8r6&dl=1"
data <- read.csv(url)
rm(url)

Dealing with missing data

Show the answer
# Replace 999 with NA in the dataset
data[data == 999] <- NA
# Remove observations with missing data
data_clean <- na.omit(data)

Convert the dataframe to a tibble and print it.

Show the answer
football_tibble <- as_tibble(data_clean)
print(football_tibble)

Delete all objects apart from the tibble

rm(data, data_clean)

Add a new player’s statistics to the tibble.

Show the answer
football_tibble <- rbind(football_tibble, data.frame(player_name = "Sophia", goals_scored = 7, assists = 5, player_position = "Forward"))
print(football_tibble)

Using dplyr, filter out players who’ve scored fewer than 16 goals.

Show the answer
top_scorers <- football_tibble %>% filter(goals_scored > 15)
print(top_scorers)

Create new variable

Calculate a new column [goals_and_assists] which is the sum of [goals_scored] and [assists], and get the average goals scored by the team.

Show the answer
football_tibble <- football_tibble %>%
      mutate(goal_per_game = goals_scored + assists)
avg_goals <- football_tibble %>%
      summarise(mean_goals = mean(goals_scored))

print(football_tibble)
print(avg_goals)

Arrange players by goals scored in descending order.

Show the answer
sorted_df <- football_tibble %>%
  arrange(desc(goals_scored))
print(sorted_df)

Group players by position and get the total goals scored for each position.

Show the answer
position_goals <- football_tibble %>%
  group_by(player_position) %>%
  summarise(total_goals = sum(goals_scored))
print(position_goals)